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ABSTRACT 

Thanks to information explosion, data for the objects of in¬ 
terest can be collected from increasingly more sources. How¬ 
ever, for the same object, there usually exist conflicts among 
the collected multi-source information. To tackle this chal¬ 
lenge, truth discovery, which integrates multi-source noisy 
information by estimating the reliability of each source, has 
emerged as a hot topic. Several truth discovery methods 
have been proposed for various scenarios, and they have been 
successfully applied in diverse application domains. In this 
survey, we focus on providing a comprehensive overview of 
truth discovery methods, and summarizing them from dif¬ 
ferent aspects. We also discuss some future directions of 
truth discovery research. We hope that this survey will pro¬ 
mote a better understanding of the current progress on truth 
discovery, and offer some guidelines on how to apply these 
approaches in application domains. 

1. INTRODUCTION 

In the era of information explosion, data have been woven 
into every aspect of our lives, and we are continuously gener¬ 
ating data through a variety of channels, such as social net¬ 
works, blogs, discussion forums, crowdsourcing platforms, 
etc. These data are analyzed at both individual and popu¬ 
lation levels, by business for aggregating opinions and rec¬ 
ommending products, by governments for decision making 
and security check, and by researchers for discovering new 
knowledge. In these scenarios, data, even describing the 
same object or event, can come from a variety of sources. 
However, the collected information about the same object 
may conflict with each other due to errors, missing records, 
typos, out-of-date data, etc. For example, the top search re¬ 
sults returned by Google for the query “the height of Mount 
Everest” include “29,035 feet”, “29,002 feet” and “29,029 
feet”. Among these pieces of noisy information, which one 
is more trustworthy, or represents the true fact? In this 
and many more similar problems, it is essential to aggregate 
noisy information about the same set of objects or events 
collected from various sources to get true facts. 

One straightforward approach to eliminate conflicts among 
multi-source data is to conduct majority voting or averag¬ 
ing. The biggest shortcoming of such voting/averaging ap¬ 
proaches is that they assume all the sources are equally reli¬ 


able. Unfortunately, this assumption may not hold in most 
cases. Consider the aforementioned “Mount Everest” exam¬ 
ple: Using majority voting, the result “29,035 feet”, which 
has the highest number of occurrences, will be regarded as 
the truth. However, in the search results, the information 
“29, 029 feet” from Wikipedia is the truth. This example 
reveals that information quality varies a lot among differ¬ 
ent sources, and the accuracy of aggregated results can be 
improved by capturing the reliabilities of sources. The chal¬ 
lenge is that source reliability is usually unknown a priori 
in practice and has to be inferred from the data. 

In the light of this challenge, the topic of truth discov¬ 
ery [l5|[^j[24j[29||32|[46j[48l[74}[77|[78] has gained increasing 
popularity recently due to its ability to estimate source re¬ 
liability degrees and infer true information. As truth dis¬ 
covery methods usually work without any supervision, the 
source reliability can only be inferred based on the data. 
Thus in existing work, the source reliability estimation and 
truth finding steps are tightly combined through the fol¬ 
lowing principle: The sources that provide true information 
more often will be assigned higher reliability degrees, and 
the information that is supported by reliable sources will be 
regarded as truths. 

With this general principle, truth discovery approaches have 
been proposed to fit various scenarios, and they make dif¬ 
ferent assumptions about input data, source relations, iden¬ 
tified truths, etc. Due to this diversity, it may not be easy 
for people to compare these approaches and choose an ap¬ 
propriate one for their tasks. In this survey paper, we give a 
comprehensive overview of truth discovery approaches, sum¬ 
marize them from different aspects, and discuss the key chal¬ 
lenges in truth discovery. To the best of our knowledge, this 
is the first comprehensive survey on truth discovery. We 
hope that this survey paper can provide a useful resource 
about truth discovery, help people advance its frontiers, and 
give some guidelines to apply truth discovery in real-world 
tasks. 

Truth discovery plays a prominent part in information age. 
On one hand we need accurate information more than ever, 
but on the other hand inconsistent information is inevitable 
due to the “ variety ” feature of big data. The develop¬ 
ment of truth discovery can benefit many applications in 
different fields where critical decisions have to be made 
based on the reliable information extracted from diverse 
sources. Examples inclu de healthcare 42], cro wd/social 
sensing [5|26|40| 60 67 68], crowdsourcing [6|28|71] , informa- 








tion extraction 27(76 , knowledge graph construction 17 18 


and so on. These and other applications demonstrate the 
broader impact of truth discovery on multi-source informa¬ 
tion integration. 

The rest of this survey is organized as follows: In the next 
section, we first formally define the truth discovery task. 
Then the general principle of truth discovery is illustrated 
through a concrete example, and three popular ways to cap¬ 
ture this principle are presented. After that, in Section [3] 
components of truth discovery are examined from five as¬ 
pects, which cover most of the truth discovery scenarios 
considered in the literature. Under these aspects, represen¬ 
tative truth discovery approaches are compared in Section 
[4] We further discuss some future directions of truth discov¬ 
ery research in Section |5| a nd introduce several applications 
in Section [6] In Section^] we briefly mention some related 
areas of truth discovery. Finally, this survey is concluded in 
Section [U 


2. OVERVIEW 

Truth discovery is motivated by the strong need to resolve 
conflicts among multi-source noisy information, since con¬ 
flicts are commonly observed in database [ 9 ] 10, 20 , the 


Web |38}[72] , crowdsourced data [57|[58j[8T , etc. In contrast 


to the voting/averaging approaches that treat all informa¬ 
tion sources equally, truth discovery aims to infer source 
reliability degrees, by which trustworthy information can be 
discovered. I 11 this section, after formally defining the task, 
we discuss the general principle of source reliability estima¬ 
tion and illustrate the principle with an example. Further, 
three popular ways to capture this principle are presented 
and compared. 


2.1 Task Definition 

To make the following description clear and consistent, in 
this section, we introduce some definitions and notations 
that are used in this survey. 


• An object o is a thing of interest, a source s describes 
the place where the information about objects can be 
collected from, and a value v„ represents the informa¬ 
tion provided by source s about object o. 

• An observation , also known as a record, is a 3-tuple 
that consists of an object, a source, and its provided 
value. 

• The identified truth for an object is the information 
selected as the most trustworthy one from all possible 
candidate values about this object. 

• Source weight w s reflects the probability of source s 
providing trustworthy information. A higher w s in¬ 
dicates that source s is more reliable and the infor¬ 
mation from this source is more likely to be accurate. 
Note that in this survey, the terms “source weight” and 
“source reliability degree” are used interchangeably. 


Based on these definitions and notations, let’s formally de¬ 
fine the truth discovery task as following. 

Definition 1. For a set of objects O that we are inter¬ 
ested in, related information can be collected from a set of 
sources S. Our goal is to find the truth vf, for each ob¬ 
ject o £ O by resolving the conflicts among the information 


from different sources {-u;)} s6 ,s. Meanwhile, truth discovery 
methods estimate source weights {uf 3 } S 6S that will be used 
to infer truths. 

2.2 General Principle of Truth Discovery 

In this section, we discuss the general principle adopted by 
truth discovery approaches, and then describe three popular 
ways to model it in practice. After comparing these three 
ways, a general truth discovery procedure is given. 

As mentioned in Section [l] the most important feature of 
truth discovery is its ability to estimate source reliabilities. 
To identify the trustworthy information (truths), weighted 
aggregation of the multi-source data is performed based on 
the estimated source reliabilities. As both source reliabili¬ 
ties and truths are unknown, the general principle of truth 
discovery works as follows: If a source provides trustworthy 
information frequently, it will be assigned a high reliabil¬ 
ity; meanwhile, if one piece of information is supported by 
sources with high reliabilities, it will have big chance to be 
selected as truth. 

To better illustrate this principle, consider the example in 
Table [l] There are three sources providing the birthplace 
information of six politicians (objects), and the goal is to 
infer the true birthplace for each politician based on the 
conflicting multi-source information. The last two rows in 
Table [l] show the identified birthplaces by majority voting 
and truth discovery respectively. 

By comparing the results given by majority voting and truth 
discovery in Table [I] we can see that truth discovery meth¬ 
ods outperform majority voting in the following cases: (1) If 
a tie case is observed in the information provided by sources 
for an object, majority voting can only randomly select one 
value to break the tie because each candidate value receives 
equal votes. However, truth discovery methods are able to 
distinguish sources by estimating their reliability degrees. 
Therefore, they can easily break the tie and output the value 
obtained by weighted aggregation. In this running example, 
Mahatma Gandhi represents a tie case. (2) More impor¬ 
tantly, truth discovery approaches are able to output a mi¬ 
nority value as the aggregated result. Let’s take a close look 
at Barack Obama in Tabic [Tj The final aggregated result 
given by truth discovery is “Hawaii”, which is a minority 
value provided by only Source 2 among these three sources. 
If most sources are unreliable and they provide the same in¬ 
correct information (consider the “Mount Everest” example 
in Section [I]), majority voting has no chance to select the 
correct value as it is claimed by minority. In contrast, truth 
discovery can distinguish reliable and unreliable sources by 
inferring their reliability degrees, so it is able to derive the 
correct information by conducting weighted aggregation. As 
a result, truth discovery approaches are labeled as the meth¬ 
ods that can discover “the wisdom of minority” [28| |76 . 
Next, we discuss three popular ways to incorporate this gen¬ 
eral principle in truth discovery methods. 


2.2.1 Iterative methods 

In the general principle of truth discovery, the truth com¬ 
putation and source reliability estimation depend on each 
other. Thus some truth discovery methods [l5]|24|,46,74 


designed as iterative procedures, in which the truth compu¬ 
tation step and source weight estimation step are iteratively 
conducted until convergence. 

In the truth computation step, sources’ weights are as- 
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Table 1: Illustrative Example 


sumed to be fixed. Then the truths {Vo}oeo can be inferred 
through weighted aggregation such as weighted voting. For 
example, in Investment 46 , the sources uniformly “invest” 
their reliabilities among their claimed values, and the truths 
are identified by weighted voting. To be more specific, each 
candidate value v receives the votes from sources in the fol¬ 
lowing way: 


vote{v ) = 



( 1 ) 


where S v is the set of sources that provide this candidate 
value, and |V S | is the number of claims made by source s. 
As the truths are identified by ranking the received votes, 
the final results {Vo}oee> rely more on the sources with high 
weights. This follows the principle that the information from 
reliable sources will be counted more in the aggregation. 

In the source weight estimation step, source weights are es¬ 
timated based on the current identified truths. Let’s still 
take Investment for example. In truth computation step, 
the sources invest their reliabilities among claimed values, 
and now, in source weight computation step, they collect 
credits back from the identified truths as follows: 


w s 


( vote(v) 
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( 2 ) 


That is, each source gets the proportional credits back from 
all its claimed values. As the received votes of values grow 
according to a non-linear function (Eq. 0 ) , the trustworthy 
information will have higher votes and contributes more to 
the source weight estimation. Thus the sources that provide 
trustworthy information more often will get more credits 
back and have higher weights. 


2 . 2.2 Optimization based methods 

In 6,;29,30j|32], the general principle of truth discovery is 

captured through the following optimization formulation: 

arg min EE w s ■ d(v s 0 ,v*), (3) 

{«»}.{«£} o60s6 s 

where d(-) is a distance function that measures the difference 
between the information provided by source s and the identi¬ 
fied truth. For example, the distance function can be 0-1 loss 
function for categorical data, and L 2 -norm can be adopted 
for continuous data. The objective function measures the 
weighted distance between the provided information {rj} 
and the identified truths {v*}. By minimizing this function, 
the aggregated results {r>o} will be closer to the information 
from the sources with high weights. Meanwhile, if a source 
provides information that is far from the aggregated results, 
in order to minimize the total loss, it will be assigned a low 


weight. These ideas exactly follow the general principle of 
truth discovery. 

In this optimization formulation (Eq. 0 ) , two sets of vari¬ 
ables, source weights {ui s } and identified truths {v*}, are 
involved. To derive a solution, coordinate descent 7 can be 
adopted, in which one set of variables are fixed in order to 
solve for the other set of variables. This leads to solutions in 
which the truth computation step and source weight estima¬ 
tion step are iteratively conducted until convergence. This 
is similar to the iterative methods. 

2.2.3 Probabilistic graphical model based methods 
Some truth discovery methods |48l|77|78] are based on prob¬ 
abilistic graphical models (PGMs). Figure[l]shows a general 
PGM to incorporate the principle of truth discovery, and the 
corresponding likelihood is following: 

0p(w s |/3) 0 (p(v* |o) 0pK|u*,w s ) J . (4) 

s£<S o£0 \ s£«S J 



Figure 1: The general probabilistic graphical model for truth 
discovery 

In this model, each claimed value vl is generated based on 
the corresponding truth v„ and source weight w s , and func¬ 
tion p(yl\Vo,w s ) links them together. Let’s take Gaussian 
distribution as an example. In this case, the truth can 
be set as the distribution mean, while source weight w s is 
the precision (the reciprocal of variance). Then the claimed 
value Vo is sampled from this particular distribution with pa¬ 
rameter (Vo,w s ). If source weight w s is high, the variance is 
relative small and thus the “generated” claimed value v s 0 will 
be close to the truth v*. In other words, the truth v* is close 
to the claimed values that are supported by the sources with 
high weights. Meanwhile, if the claimed value vl is close to 
the identified truth Vo, in order to maximize the likelihood, 
a small variance parameter (high source weight w s ) will be 


















estimated. This also reflects the general principle of truth 
discovery. 

To infer the latent variables {w s } and {«J}, techniques such 
as Expectation Maximization (EM) can be adopted for in¬ 
ference. The hyperparameters a and /3 also exert their influ¬ 
ence on the inference. These parameters can be used to in¬ 
corporate prior knowledge about truth distribution or source 
weight distribution. 

2.2.4 Comparison and summary 

Here we compare the above three ways to capture the general 
principle of truth discovery. 


In terms of interpretability, iterative methods are eas¬ 
ier to understand and interpret. For example, in In¬ 
vestment and Pooledlnvestment 46.1:47 , sources “in¬ 
vest” their reliabilities among their claimed values and 
collect credits back from the identified truths. While 
optimization based and PGM based solutions are de¬ 
rived from coordinate descent and inference, they are 
interpretable but need more explanations. 


• When some prior knowledge about sources or objects 
is available, it can be used to improve the truth dis¬ 
covery performance. In optimization based solutions, 
prior knowledge can be formulated as extra equality 
or inequality constraints. For PGM based methods, 
hyperparameters can capture the external knowledge. 

All these three ways are widely used to encode the gen¬ 
eral principle into truth discovery methods. By providing 
the above comparison, we do not claim that one of them is 
better than another. In fact, the coordinate descent in op¬ 
timization based approaches and the parameter inference in 
PGM based approaches lead to iterative update rules that 
are similar to iterative methods, and it is possible to formu¬ 
late the iterative methods as an optimization problem or a 
parameter inference task. 

The general procedure of truth discovery is summarized in 
Algorithm]]] Usually, truth discovery methods start with an 
initialization of source weights, and then iteratively conduct 
truth computation step and source weight estimation step. 
Some stopping criteria are adopted in practice to control the 
number of iterations. One commonly adopted criterion is to 
check the change of identified truths or source weights, and 
terminate the algorithm when the change is smaller than a 
pre-defined threshold. 


Algorithm 1 General Procedure of Truth Discovery 

Input: Information from sources {i'o}oee>,sgs. 

Output: Identified truths {t)*} 0 go and the estimated 
source weights {w s } s e5- 

1: Initialize source weights (u) s } s es; 

2: repeat 

3: for o 4 — 1 to \0\ do 

4: Truth computation: infer the truth for object o 

based on the current estimation of source weights; 

5: end for 

6 : Source weight estimation: update source weights 

based on the current identified truths; 

7: until Stop criterion is satisfied; 

8 : return Identified truths K}o60 and the estimated 
source weights {w s } s es- 


3.1.1 Duplicate input data 

It is possible that one source may make several observations 
about the same object. For example, a Wikipedia contribu¬ 
tor may edit the information about the same entry several 
times, or a crowdsourcing worker can submit his output for a 
specific task multiple attempts. However, most of the truth 
discovery methods assume that each source makes at most 
one observation about an object. If the timestamp for each 
observation is available, a possible approach is to consider 
the data freshness [52] and select the up-to-date observation. 
Otherwise, some pre-defined rules can be adopted to choose 
one from multiple observations. 

3.1.2 Objects without conflict 

For some objects, all the observations made by sources have 
the same claimed value. In this case, most of the truth dis¬ 
covery methods should give the same results which is the 
claimed value (one exception is the method that considers 
“unknown” as an output candidate [46]). So it might be 
safe to remove these trivial records. Furthermore, in |77| , 
the authors report that this pre-processing improves the ef¬ 
fectiveness of truth discovery methods. This is because of 
the fact that if all the sources agree with each other, these 
observations may not contribute (too much) to the source 
reliability estimation. It should be pointed out that these 
trivial records do affect the estimation of source reliability, 
and thus this pre-processing step should be carefully exam¬ 
ined before performing it. 


3. ASPECTS OF TRUTH DISCOVERY 

As mentioned above, various truth discovery approaches 
have been proposed for several scenarios. They have differ¬ 
ent assumptions about input data, relations among sources 
and objects, identified truths, etc. Meanwhile, applications 
in various domains also have their unique characteristics 
that should be taken into account. These motivate the needs 
for diverse truth discovery techniques. In this section, we 
summarize them from the following aspects: input data, 
source reliability, object, claimed value, and the output. 

3.1 Input Data 

We first discuss several features of input data that requires 
pre-processing conducted before the truth discovery step. 


3.1.3 Input data format 

As the information are collected from various sources, they 
may have different formats [3l]. For example, when the 
object is “the height of Mount Everest”, some sources have 
claimed values as “29,029 feet”, while others have claimed 
values as “8848 meters”. Another case, for example, “John 
Smith” and “Smith, John”, is commonly observed in text 
data. In fact, these claimed values are the same one and 
they should be formatted to an identical value. 

3.1.4 Input uncertainty 

When the observations are extracted from textual data (for 
example, in the knowledge fusion task M) or the sources 
provide observations with their confidence indicators (for ex¬ 
ample, in the question-answering system ©). it is necessary 









to consider the uncertainty of these observations. In |j47|, the 
authors propose a way to generalize truth discovery meth¬ 
ods, which considers multi-dimensional uncertainty, such as 
the uncertainty in information extractors. 


3.1.5 Structured v.s. unstructured data 



However, at the same time, this extra information introduces 
more noise and uncertainty. 


3.1.6 Streaming data 

In many real-world applications, data continue to arrive over 
time. Most of the existing truth discovery methods are batch 
algorithms and work on static data. These methods are 
inefficient to process streaming data as they need to re-run 
the batch algorithms when new data are available. To tackle 


this challenge, some truth discovery algorithms 32 64 79 
have been designed for different types of streaming data. 


3.1.7 Labeled truths 

Besides the input data, truth discovery methods might as¬ 
sume some additional labeled information. As labeled truths 
are usually difficult to collect, most truth discovery methods 
are unsupervised, i.e., estima ting th e truths without any la¬ 
beled information. While in 21,35,75 , the authors assume 
that a small set of truths are available and thus the proposed 
algorithms work in semi-supervised settings. Therefore, a 
few available truths can be used to guide source reliability 
estimation and truth computation. The results show that 
even a small set of labeled truths could improve the perfor¬ 
mance. 


3.2 Source Reliability 

As source reliability estimation is the most important fea¬ 
ture of truth discovery, here we examine some assumptions 
and discuss several advanced topics about the source relia¬ 
bility estimation. 


3.2.1 Source consistency assumption 


Most truth discovery methods 15 


24 


30 


46,74,75 


77 have 


the source consistency assumption, which can be described 
as follows: A source is likely to provide true information 
with the same probability for all the objects. This assump¬ 
tion is reasonable in many applications. It is one of the most 
important assumptions for the estimation of source reliabil¬ 
ity. 


3.2.2 Source independence assumption 
Some truth discovery methods [24 


30 


46 74 77 have the 


source independence assumption, which can be interpreted 
as the fact that sources make their observations indepen¬ 
dently instead of copying from each other. This assumption 
is equivalent to the following: The true information is more 
likely to be identical or similar among different sources, and 
the false information provided by different sources is more 
likely to be different. 


3.2.3 Source dependency analysis 
In 14jfi6] |49]|50||53] , the authors relax source independence 
assumption. In these work, the authors try to detect copying 
relationship among sources, and then adjust the sources’ 
weights according to the detected relationship. This copying 
detection is beneficial in application scenarios where some 
sources copy information from others. The main principle 
behind the copy detection is that if some sources make many 
common mistakes, they are not likely to be independent with 
each other. However, this principle becomes ineffective when 
some sources copy information from a good source. 

In 15], the authors propose a method to detect direct copy¬ 
ing relationship. They assume that a copier does not copy 
all the information from other sources and may provide some 
information by itself. In other words, the detected copying 
relationship is represented by a probability that a source 
copies the provided information from others. The authors 
apply Bayesian analysis to infer the existence and direction 
of copying relationship among sources. This copying de¬ 
tection procedure is tightly combined with truth discovery, 
and thus the detected copying relationship and the discov¬ 
ered truths are iteratively updated. On the other hand, 50 


alleviates the source dependency problem by revealing the 
latent group structure among sources, and aggregates the 
information at the group level. This can reduce the risk 
of overusing the information from the dependent sources, 
especially when these sources are unreliable. 

Compared with the above work that the copying relation¬ 
ship is detected from snapshots of data, [l6j detects copy¬ 
ing relationship from dynamic data, where the provided in¬ 
formation from sources are changing. The authors apply 
Hidden Markov Model (HMM) to detect copying relation¬ 
ship given the update history of sources, and the outputs 
are the evolving copying relationship and the evolving true 
values. To evaluate the quality of sources, instead of sim¬ 
ply considering their accuracy, more metrics are adopted for 
the dynamic data, such as coverage (how many values in 
the history a source covers), exactness (how many updates 
conform to the reality), and freshness (how quickly a source 
captures a new value). The copying relationship detection is 
performed by considering the source quality and the update 
behaviors among sources simultaneously. Similar to |l5| , 
both the copying relationship and true values are inferred in 
an iterative procedure. 

However, there may be other complex copying relationships 
rather than direct copying, such as co-copying (i.e., multiple 
sources copy from one source), and transitive copying (i.e., a 
source may copy from other sources transitively). To detect 
such global copying relationship, 14] extends the idea of jT5| 
and [l6]: By considering both the completeness and accu¬ 
racy of sources, the existence and direction of direct copying 
relationship are detected. Then based on these detected di¬ 
rect copying relationship, Bayesian analysis is performed to 
detect global copying relationship. 

The main principle behind the copy detection is that copiers 
make the same mistakes with sources they copy from, which 
is limited in characterizing the relations among sources. 
In [49], the authors consider different correlations between 
sources which are more general than copying. Sources may 
have positive correlations, such as copying information from 
each other or sharing similar rules for information extrac¬ 
tors; sources may also have negative correlations, such as 

































providing data from complementary domains or focusing on 
different fields of information. Intuitively, the observations 
from positively correlated sources should not increase the 
belief that they are true, and the observations supported 
by only a few negatively correlated sources should not de¬ 
crease the belief that they are true. The proposed method 
models correlations among sources with joint precision and 
joint recall, and Bayesian analysis is preformed to infer the 
probability of an observation to be true. Approximation 
methods are also provided as the computation complexity 
of estimating joint precision and recall is exponential. 

In above work, source correlations are inferred from data. 
While in practice, such source correlation information might 
be available as extra input. For example, the authors in 65 
fuse the observations from multiple Twitter users by esti¬ 
mating the reliability of each individual user (source). In 
this application scenario, Twitter users may report observa¬ 
tions made by others as their own, which is a form of source 
copying phenomenon. Instead of estimating the correlations 
among users, the follower-followee graph and retweeting be¬ 
haviors in Twitter capture the correlations among users, and 
such extra information can be adopted to improve the re¬ 
sults of truth discovery algorithms. 


3.2.4 Fine-grained source reliability 

In some scenarios, it is reasonable to estimate multiple 

source reliabilities for a single source, and thus the variety 


in source reliability can be captured. The authors in 25 


demonstrate that when objects can be clustered into sets, 
it is better to estimate a source reliability degree for each 
object set. Similarly, as a website may have very different 
reliabilities for different attributes of objects, the authors 
in [75] treat a website as multiple sources according to at¬ 
tributes and objects, instead of treating the website as a sin¬ 
gle source. This is equivalent to assigning multiple source 
reliability degrees to a single website. In [37], the authors 
propose a probabilistic graphical model to jointly learn the 
latent topics of questions (objects), the fine-grained source 
reliability, and the answers to questions (truths). Further¬ 
more, as the number of provided observations becomes very 
large, the source consistency assumption may not hold any 
more, and it is more reasonable to have multiple source re¬ 
liabilities per source. 


also capture the probability that this source fails to provide 
truths (recall). Thus, in the proposed method LTM, both 
precision and recall are adopted to estimate the source re¬ 
liability. On the other hand, it is possible that there is no 
truth for some objects, e.g., the death date of some one who 
is still alive. Under this scenario, the authors in [80 pro¬ 
pose several new metrics, such as silent rate, false spoken 
rate and true spoken rate, to describe the source quality. 
When some sources only provide one or two pieces of infor¬ 
mation, truth discovery methods may not be able to accu¬ 
rately estimate reliability degrees for these sources. How¬ 
ever, such sources are commonly observed and they may 
contribute to the main part of the collected data. To better 


deal with such “small” sources, in 29 , the authors propose 


a confidence-aware truth discovery approach that outputs 
the confidence interval of the source reliability estimation. 
In [45], the authors assume that the truths are subjective 
rather than objective. That is, the identified truths are de¬ 
pendent on the end-users: For different users, the truths 
can be different. To fit this scenario, instead of representing 
the reliability of a source as a single scalar, they represent 
it as three interrelated, but separate values: truthfulness, 
completeness, and bias. Specifically, truthfulness reflects 
the probability of a source asserting the truth, complete¬ 
ness reflects the proportion of objects which are covered by 
a source, and bias is the extent to which a source tends to 
support a favored position. These three metrics are calcu¬ 
lated according to the importance of claims with respect to 
a specific user, so that they can better reflect users’ prior 
knowledge and preferences. 

In 52], the authors study the problem of selecting a subset 
of sources in dynamic scenarios for truth discovery. To bet¬ 
ter capture the time-dependent source quality, the y d efine a 
set of metrics, including accuracy (truthfulness in 45 ), cov¬ 


erage (completeness in [45]), and freshness (the frequency of 
updates provided by a source). 

In the probabilistic model of 48], the source reliability has 
different semantic meanings in different settings: It can in¬ 
dicate the probability of a source asserting truth, or the 
probability of a source both knowing and telling the truth, 
or even the probability of a source intending to tell the truth. 
The model selection depends on the specific application sce- 


3.2.5 Enriched meaning of source reliability 
In most truth discovery work, source reliability is a param¬ 
eter that is positively correlated with the probability of a 
so urce asserting truths. However in work 29,45,48,52 [67 


78||80 , the meaning of this parameter is further enriched to 


fit more complex application scenarios. 

I 11 many applications of crowd/social sensing, people want 
to infer an observation is true or false, for example, whether 
or not a specific gas station runs out of gas. Thus in 67 , the 


authors model the observations from participants as binary 
variables, and capture the source (participant) reliability by 
estimating both true positive rate and false positive rate. 

In LTM [78], the authors assume that the true information 
about an object can contains more than one value, for exam¬ 
ple, the authors of a book can be more than one. To estimate 
the quality of a source, it is insufficient to only calculate the 
probability that this source’s provided information is accu¬ 
rate (precision). Furthermore, the source reliability should 


3.2.6 Source reliability initialization 
Most of truth discovery methods start with uniform weights 
among all sources. As a result, the performance of truth 
discovery may rely on the majority. When the majority of 
sources are good, this strategy works well. However, the 
real scenarios usually are not the case, as sources may copy 
information from others or they may provide out-of-date in¬ 
formation. Nowadays, people apply truth discovery on chal¬ 
lenging tasks, such as information extraction and knowledge 
graph construction. In these challe ngin g tasks, most of the 
sources are unreliable. For example, 76 reports that in their 


task, “62% of the true responses are produced only by 1 or 
2 of the 18 systems (sources)”. This example reveals that a 
better initialization for source reliability is much in demand. 
Recent work adopts a subset of labeled data [2l]|35][75], an 
external trustful information source 17 , or the similarity 


among sources 76 as prior knowledge to initialize (or help 


to initialize) the source reliability. 



















3.2.7 Source selection 

We often expect better performance of truth discovery when 
more data sources are involved in. However, nothing comes 
for free in practice - both economic and computational costs 
should be taken into consideration when applying truth dis¬ 
covery. Moreover, 21 shows that the incorporation of bad 
sources may even hurt the performance of truth discovery. 
In order to solve these problems, 21,52 53] provide meth¬ 
ods to wisely select sources for truth discovery constrained 
by the cost and output quality. 


In 53 , the authors propose a method to select sources un¬ 


der different cost constraints, in which the task is formu¬ 
lated as an optimization problem. Their Integrating De¬ 
pendent Sources (IDS) system can return an optimal subset 
of sources to query, and the selection guarantees that the 
cost constraints are satisfied. The greedy and randomized 
approximation algorithms proposed for this problem run in 
polynomial time with provable quality guarantees. 

In [21], source selection is performed on static data sources. 
The authors first provide a dynamic programming algorithm 
to infer the accuracy of integrating any arbitrary subset of 
sources. Then a randomized greedy algorithm is adopted 
to select the optimal subset of sources by incorporating a 
source in integration only if the gain of accuracy exceeds 
the corresponding cost. While in 52], source selection is 
performed on dynamic data sources whose contents are up¬ 
dated over time. A greedy algorithm is developed to output 
near-optimal source selection and the optimal frequencies to 
acquire sources for update-to-date information. 

More interestingly, [28] claims that a source is useless only 
when it guesses answers randomly. Even very bad sources 
can make positive contributions if truth discovery approach 
assigns negative weights to them. In other words, if a bad 
source provides a piece of information, we can infer that this 
information might be wrong with high probability. 


3.3 Object 

In this section, we describe how the object difficulty and the 
relations among objects affect truth discovery. 


3.3.1 Object difficulty 

In 24], 3-Estimates algorithm is proposed, which estimates 
not only the truth and source reliability, but also the dif¬ 
ficulty of getting the truth for each object. This difficulty 
factor is captured by introducing the trustworthiness of each 
claimed value. In other words, if a source makes an error 
on an object, the penalty will be distributed to both the 
“source reliability” factor and the “object difficulty” factor. 
Thus by considering the object difficulty, the errors intro¬ 
duced by objects and sources are separated, and the source 
reliability can be better estimated. 


3.3.2 Relations among objects 

Most of the truth discovery methods assume that objects 
are independent. I 11 practice, objects may have relations 
and could affect each other. For example, “the birth year 
of a person” has strong relation with “the age of a person”, 
and “A is the father of B” indicates that “B is the son of 
A”. Such prior knowledge or common sense about object 
relations could improve the results of truth discovery. 

In 46], prior knowledge is translated into propositional con¬ 
straints that are integrated into each round of truth discov¬ 
ery process. Specifically, each fact (an object and its corre¬ 


sponding value) is represented as a [0,1] variable, and then 
according to the prior knowledge, related facts are combined 
into propositional constraints. A cost function is defined as 
the difference between the original results solely based on 
truth discovery and new results which satisfy the proposi¬ 
tional constraints. By minimizing the cost, the probabil¬ 
ity of each fact to be true is “corrected” according to prior 
knowledge during each iteration. To guarantee the feasibil¬ 
ity of this optimization problem, for each object, an aug¬ 
mented “unknown” answer is incorporated to relax the con¬ 
straints and avoid the possible conflicts among constraints. 
In 76 , information extraction and truth discovery meth¬ 


ods are combined to solve Slot Filling Validation task which 
aims to determine the credibility of the output informa¬ 
tion extracted by different systems from different corpora. 
The authors construct a Multi-dimensional Truth-Finding 
Model (MTM) which is a heterogeneous network including 
systems, corpora, extracted information, and weight matri¬ 
ces between them. Similar to the iterative truth discovery 
process, the credibility of the extracted information is prop¬ 
agated within the network to infer the reliabilities of both 
systems and corpora, and in turn, the reliabilities of sys¬ 
tems and corpora are used to refine the credibility of the 
extracted information. To initialize the credibility, the au¬ 
thors consider the dependent relationship among different 
slots (object) by constructing a knowledge graph. For ex¬ 
ample, “Adam is a child of Mary” should have high credibil¬ 
ity if we already believe that “Bob is a child of Mary” and 
“Adam is a sibling of Bob”. 

Temporal and spatial relations also exist among objects. For 
example, today’s high temperature for New York City has 
correlation with the one of yesterday, and the nearby seg¬ 
ments of roads may have correlated traffic conditions. Such 
temporal and spatial relations among objects can be cap¬ 
tured to benefit truth discovery procedure. Recently, 69 
and [32] model the temporal relations among evolving ob¬ 
jects on categorical data and continuous data respectively, 
while 68] and [39] can handle both temporal and spatial 
relations on categorical data and continuous data respec¬ 
tively. Further, [39] demonstrates that capturing the rela¬ 
tions among objects can greatly improve the performance of 
truth discovery as objects may receive insufficient observa¬ 
tions in many real-world applications. 

3.4 Claimed value 

In this section, we discuss some issues about the claimed 
values to be considered in the development of truth discovery 
approaches. 

3.4.1 Complementary vote 

The complementary vote technique can be used to infer ex¬ 
tra information from the claimed values. This technique is 
based on the single truth assumption that “there is one and 
only one true value for each object”. If a source provides a 
value about an object, 2 4|78 ' assumes that this source votes 
against other candidate values of this object. This comple¬ 
mentary vote is also related with the Local Closed World 
Assumption (LCWA) EH , in which any candidate values 
that violate external knowledge cannot be the truth. The 
difference between them is that, for LCWA, the trustful ex¬ 
ternal knowledge is used to reject some candidate values of 
the same object, while for complementary vote, one candi¬ 
date value is used to reject other candidate values. 















3.4.2 Implication of values 

In [74], the concept of implication between claimed values 
is proposed. For the same object, different claimed values 
about this object are not independent, instead, they are 
correlated with each other. Let’s consider “the height of a 
patient” as an object, and various sources provide three dif¬ 
ferent claimed values: 175cm, 176cm, and 185cm. If 175cm 
is trustworthy, it implies that 176cm is also trustworthy with 
a high probability, while the chance of 185cm being trust¬ 
worthy is lower. This implication factor between claimed 
values is also considered in I 5 ]. 


3.4.3 Data type of claimed values 

Actually, the aforementioned implication functions capture 
the distance of continuous values. In some sense, the com¬ 
plementary vote 24 78 is based on the distance of cate¬ 


gorical values. In fact, it is essential to note that different 
data types have different concepts of “distance”. Truth dis¬ 
covery methods, such as [15||24| 46,74], focus on categorical 
data type, though i!ZH can be extended to continuous or 
string data by adding implication functions. On the other 
hand, GTM 77 is especially designed for continuous data 
type. From another perspective, [75| uses different relation¬ 
ships among observations to capture the property of differ¬ 
ent data types: for categorical data type, mutual exclusive 
relations exist among observations, and for continuous data 
type, mutual supportive relations are modeled. Unfortu¬ 
nately, these relations are slightly limited and usually not 
easy to establish in practice. Recently, [30 proposes a gen¬ 
eral truth discovery framework for heterogeneous data, in 
which the unique properties of data types are captured by 
appropriate distance functions. 


3.4.4 Ambiguity of data type 

It is interesting to note that data types may be ambiguous 
sometimes. For example, “2” can be considered as a con¬ 
tinuous number naturally, but under some circumstances, it 
should be considered as categorical data, such as the class 
label of an object. If it appears in an address, then it should 
be considered as a string. This example illustrates that the 
type of data depends on the specific application, and the 
assumption about data type should be examined before ap¬ 
plying a truth discovery approach. 


3.4.5 Hierarchical structure of claimed values 
As mentioned above, both complementary vote and mutual 
exclusivity capture the distance of categorical or textual val¬ 
ues, but they do not fully explore the distance. In 117 18 


the authors argue that the claimed values can be represented 
in a hierarchical value space. For example, Barack Obama’s 
birthplace is Honolulu, but it is also true that his birth¬ 
place is Hawaii, or even the USA. In this case, the claimed 
values are not mutually exclusive, and the complementary 
vote cannot be applied. It is also different with the multiple 
truths scenarios, as these true values are linked with each 
other through a hierarchical structure. 


3.5 Output 

In this section, we discuss the following factors that need 
to be taken into consideration for the outputs of truth dis¬ 
covery: (1) What kind of assumptions we should adopt for 
the identified truths? (2) Which format of output is better, 


labeling or scoring? (3) How to evaluate the performance of 
truth discovery methods? (4) How to interpret the outputs 
of truth discovery? 


3.5.1 Single truth v.s. multiple truths 
Most of the truth discovery methods |15|24|3 0 146|74|77 hold 
the “single truth” assumption which assumes there is one 
and only one truth for each object. With this assumption, 
truth discovery task aims to sel ect the most trustworthy 
information as truths. In [151 24 , this assumption is made 


to be more explicit and stronger: If a source provides a 
claimed value for an object, this source is assumed to vote 
against other possible claimed values for this object. 

The “single truth” assumption holds in many application 
scenarios, however it is not always true. For example, 
there are multiple authors for a book and also multiple ac¬ 
tors/actresses for a movie. In [78| , a probabilistic graphical 
model LTM is proposed to discover multiple truths for each 
object. Under such scenarios, only considering source ac¬ 
curacy cannot distinguish the difference between a source 
with low precision and a source with low recall. However, 
both precision and recall are essential for discovering multi¬ 
ple truths. Thus LTM considers both false positive and false 
negative claims, and it can discover multiple truths simul¬ 
taneously. Similarly, in 49 , the authors also calculate both 
the source precision and recall to discover multiple truths. 


3.5.2 “Unknown” truths 

In [46], for each object, an extra candidate output “un¬ 
known” is considered. This technique is motivated by the 
following fact: When the relations among objects are repre¬ 
sented as constraints, these constraints may contradict each 
other. Thus it is possible that no solution (truths for all the 
objects) can satisfy all the constraints. By introducing the 
output choice of “unknown”, at least one feasible solution, 
“unknown” for all the objects, is promised. More impor¬ 
tantly, for the objects that do not have a truth, the choice 
of “unknown” can be a correct output. For example, it is 
suitable to give an output of “unknown” for the questions 
(objects) about the death date of someone who is still alive. 
The authors in 80 propose a probabilistic model to take 
into account the truth existence, and the proposed method 
can assign the output of “unknown” to the questions that 
have no truths. 


3.5.3 Labeling v.s. scoring 

Typically, the outputs of truth discovery methods fall into 
one of the following two categories: labeling and scoring. 
The labeling technique assigns a label (true or false) to each 
claimed value, or assigns a truth to each object. Here note 
that the truths given by truth discovery methods, such as 
GTM 77 and the framework in [46], are not necessarily be 
observed or claimed by any sources. The benefit of labeling 
technique is that the results are ready to use and can be 
directly fed into other applications, while its drawbacks are 
that some useful information is lost during the iterations 
and it does not provide confidence scores for the identified 
truths. 

This motivates the need for the scoring technique, which 
assigns a score to each claimed value, usually in the form 
of probability. Thus the post-processing step is needed to 
choose the truths by using a cut-off threshold or choosing 
the top ones for each object. The benefit of the scoring 















technique is that all the claimed values have certain prob¬ 
ability to be chosen as truths and there is less information 
loss during the iterative procedure. 


3.5.4 Performance measurement 

To evaluate the effectiveness of truth discovery methods, 
various performance metrics are adopted: accuracy (or error 
rate) for categorical data, Mean of Absolute Error (MAE) 
and Root of Mean Square Error (RMSE) for continuous 
data. These met rics are suitable when groundtruths are 
available. In 75 , the “usefulness” of results is proposed: 


Not all the results are worth being evaluated, instead, we 
only care about the useful identified truths. Besides, mem¬ 
ory cost and running time can be used to evaluate the effi¬ 
ciency. 


3.5.5 Output explanation 

The interpretations of truth discovery results are important 
for both end-users (“why should I trust the results”) and 
developers (“how to improve the results”). However, most 
of the truth discovery methods output opaque information. 
For example, the output could be that the reliability degree 
of a source is 6, or the support from sources for a claimed 
value is 3.7. These intermediate parameters are not inter¬ 
pretable for end-users and developers, and we need to con¬ 
struct some intuitive explanations based on the raw outputs. 


In 48 , source reliability degree is modeled as the proba¬ 


bility of a source asserting the truth, which is more mean¬ 
ingful for human interpretation. In addition, they extend 
the source reliability definition by considering the difference 
between knowing the truths and telling the truths. Specifi¬ 
cally, in their SimpleLCA model, the reliability is considered 
as the probability of a source asserting the truths, while in 
GuessLCA model the reliability is the probability of a source 
both knowing and asserting the truths, and in MistakeLCA 
and LieLCA models, the reliability means the probability of 
a source intending to tell the truths. 


In 22 , the authors propose methods to automatically gener¬ 
ate both snapshot explanation and comprehensive explana¬ 
tion for truth discovery. These explanations could be used to 
explain the results of truth discovery derived from iterative 
computing procedure. Snapshot explanation provides end- 
users a high-level understanding of the results by gathering 
all the positive and negative explanations for a specific de¬ 
cision over alternative choices. It lists all the evidences that 
support the decision as positive explanations and all the 
evidences which are against the decision as negative ones. 
In order to make the explanation more concise, categoriza¬ 
tion and aggregation are conducted so that similar evidences 
will not appear repeatedly. In addition, they perform evi¬ 
dence list shortening to remove evidence with less impor¬ 
tance, and thus human-interpretable explanations can be 
derived. Compared with snapshot explanation, comprehen¬ 
sive explanation constructs a directed acyclic graph (DAG) 
to provide an in-depth explanation of the results. The DAG 
is constructed by tracing the decisions from the last round 
of iteration all the way to the first iteration where each node 
represents a decision and the children represent evidences. 
After the DAG construction, an explanation shortening is 
performed to eliminate those rounds where no decisions are 
reached to derive a concise explanation. 


4. TRUTH DISCOVERY METHODS 

In the previous section, we summarize various truth discov¬ 
ery methods from five aspects, namely, input data, source 
reliability, object, claimed value, and the output. Now, we 
briefly describe several representative truth discovery meth¬ 
ods, and compare them under different features (Tables [2] 
and[3|. By providing such a comparison, we hope to give 
some guidelines so that users and developers can choose an 
appropriate truth discovery method and apply it to the spe¬ 
cific application scenarios. 

Due to space limitation, here we only describe each truth 
discovery algorithm briefly. For more details about these 
methods, the readers may refer to the reference papers. 


• TruthFinder 174]: In TruthFinder, Bayesian analysis is 
adopted to iteratively estimate source reliabilities and 
identify truths. The authors also propose the source 
consistency assumption and the concept of “implica¬ 
tion” , which are widely adopted in other truth discov¬ 
ery methods. 


AccuSim EH31 : AccuSim also applies Bayesian anal¬ 
ysis. In order to capture the similarity of claimed val¬ 
ues, the implication function is adopted. 


AccuCopy 15 31 : This method improves AccuSim, 
and considers the copying relations among sources. 
The proposed method reduces the weight of a source 
if it is detected as a copier of other sources. 


• 2-Estimates [24]: In this approach, the single truth as¬ 
sumption is explored. By assuming that “there is one 
and only one true value for each object”, this approach 
adopts complementary vote. 


• 3-Estimates [24]: 3-Estimates augments 2-Estimates 
by considering the difficulty of getting the truth for 
each object. 


• Investment 46 : In this approach, a source uniformly 
“invests” its reliability among its claimed values, and 
the confidence of a claimed value grows according to 
a non-linear function defined on the sum of invested 
reliabilities from its providers. Then the sources col¬ 
lect credits back from the confidence of their claimed 
values. 


• SSTF 75 : In this semi-supervised truth discovery 


approach, a small set of labeled truths are incorpo¬ 
rated to guide the source reliability estimation. Mean¬ 
while, both mutual exclusivity and mutual support are 
adopted to capture the relations among claimed val- 


• LTM (78]: LTM is a probabilistic graphical model 
which considers two types of errors under the scenar¬ 
ios of multiple truths: false positive and false negative. 
This enables LTM to break source reliability into two 
parameters, one for false positive error and the other 
for false negative error. 

• GTM [77| : GTM is a Bayesian probabilistic approach 
especially designed for solving truth discovery prob¬ 
lems on continuous data. 
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Table 3: Comparison of Truth Discovery Methods: Part 2 


• Regular EM [67 : Regular EM is proposed for 
crowd/social sensing applications, in which the obser¬ 
vations provided by humans can be modeled as binary 
variables. The truth discovery task is formulated as 
a maximum likelihood estimation problem, and solved 
by EM algorithm. 

• LCA [48]: In LCA approach, source reliability is mod¬ 
eled by a set of latent parameters, which can give more 
informative source reliabilities to end-users. 

• Apollo-social [65]: Apollo-social fuses the information 
from users on social media platforms such as Twitter. 
In social network, a claim made by a user can either be 
originally published by himself or be re-tweeted from 
other users. Apollo-social models this phenomenon 
as source dependencies and incorporates such depen¬ 
dency information into the truth discovery procedure. 

• CRH 130 : CRH is a framework that deals with the het¬ 
erogeneity of data. In this framework, different types 
of distance functions can be plugged in to capture the 


characteristics of different data types, and the estima¬ 
tion of source reliability is jointly performed across all 
the data types together. 

• CATD [29]: CATD is motivated by the phenomenon 
that many sources only provide very few observations. 
It is not reasonable to give a point estimator for source 
reliability. Thus in this confidence-aware truth discov¬ 
ery approach, the authors derive the confidence inter¬ 
val for the source reliability estimation. 

We compare these truth discovery approaches in Tables [2] 
an d 02 Here we summarize them under different features as 
follows: 

• Input data: (1) Most of truth discovery methods can 
handle categorical data. Although GTM and CATD 
are designed for continuous data, it is feasible to en¬ 
code the categorical data into probabilistic vectors [6]. 
(2) TruthFinder, AccuSim and AccuCopy can han¬ 
dle continuous data by incorporating the implication 
function, and SSTF captures the characteristic of con- 














tinuous data by mutual support. GTM, CRH and 
CATD are particularly designed for continuous data. 
(3) Besides CRH that is proposed for heterogeneous 
data, SSTF can also deal with categorical and con¬ 
tinuous data simultaneously. (4) Among these truth 
discovery methods, only SSTF is proposed to work in 
semi-supervised setting. However, in practice, most 
truth discovery methods can be modified to take ad¬ 
vantage of labeled truths. One possible solution is to 
replace some aggregated results in each iteration by 
corresponding labeled truths, which can thus guide the 
source reliability estimation. 


• Source reliability: (1) There is a series of work about 
source dependency analysis p| fl6p9p3] , and here we 
only list the AccuCopy algorithm and Apollo-social 
due to space limitation. (2) LTM, Regular EM, LCA 
and CATD enrich the meaning of source reliability 
from different aspects. 


• Object: Among the discussed truth discovery meth¬ 
ods, 3-Estimates is the only one that takes into ac¬ 
count the object difficulty. Investment and Pooled- 
Investment approaches can incorporate the relations 
among objects. More efforts are needed to develop 
truth discovery methods that consider the advanced 
features about objects. 


• Claimed value: Several truth discovery approaches 
15|24|31|75] adopt the concept of complementary vote 
to strengthen the belief in single truth assumption. 


• Output: LTM is proposed to handle the multiple 
truth scenarios. Investment and Pooledlnvestment are 
augmented with “unknown” truths. Note that it is 
possible to modify some truth discovery methods, for 
example, LTM, to consider the unknown truths by 
adding one more candidate value “unknown” for each 
object. 


scenarios reveal the importance of capturing the rela¬ 
tions among objects. In knowledge graph 17], objects 
can be related in various ways, such as “the birth year 
of a person” and “the age of a person”. Although some 
efforts 39 46][68] have been made to consider the ob¬ 
ject relations, this problem needs more explorations. 
A more difficult question is to automatically discover 
relations among objects. Due to the large scale of the 
involved objects, it is impossible to manually detect 
and encode such relations. 


• Source reliability initialization. Some potential 
disadvantages are observed on truth discovery with 
uniform weight initialization: (1) At the beginning of 
the iterative solutions, the behavior of truth discov¬ 
ery is identical to voting/averaging, and thus truth 
discovery randomly chooses outputs for tie cases. If 
the ways of breaking ties are different, the estimated 
source weights may be quite different. (2) The uniform 
weight initialization leads to the phenomenon that the 
aggregated results of truth discovery methods rely on 
the majority to a certain extent. Nowadays, truth dis¬ 
covery has been applied in more difficult tasks such as 
slot filling i76j, in which the majority of sources pro¬ 
vide inaccurate information. Thus the performance of 
truth discovery may suffer from the uniform weight 
initialization on these tasks. These disadvantages mo¬ 
tivate people to investigate new ways to initialize the 
source reliability. 

• Model selection. Given various truth discovery 
methods, how to select an appropriate one to apply 
to some specific tasks? Although we compare them 
from different aspects and give some guidelines in pre¬ 
vious section, it is still a difficult task. Is it possible to 
apply various truth discovery methods together, and 
then combine the outputs of various methods as final 
output? This ensemble approach might be a possible 
solution to tackle the model selection challenge. 


5. FUTURE DIRECTIONS 

Although various methods have been proposed, for truth 
discovery task, there are still many important problems to 
explore. Here we discuss some future directions of truth 
discovery. 


Unstructured data. For most of the truth discovery 
approaches, they assume that the inputs are available 
as structured data. Nowadays, more and more applica¬ 
tions of truth discovery are dealing with unstructured 
data such as text. In jl8 76 , the inaccurate informa¬ 
tion can come from both text corpora and information 
extractors, i.e., two layers of sources. If the object diffi¬ 
culty is also considered, the penalty of wrong informa¬ 
tion should be distributed to three factors, corpora, 
extractors, and objects. This brings new challenges 
to source weight estimation. Further, the extracted 
inputs from unstructured data are much more noisy 
and also bring new information (uncertainty [46| , evi¬ 
dence |76|, etc.) to be taken into consideration. 


• Theoretical analysis. Will the truth discovery 
methods promise convergence? If so, what is the rate 
of convergence and is it possible to bound the errors 
of converged results? These and many more questions 
need further exploration. Another interesting task is 
to explore the relations or even the equivalence among 
various truth discovery approaches. In Section [2j we 
show some evidences that different ways to capture the 
general principle are equivalent. We may better un¬ 
derstand the relations among different truth discovery 
methods by exploring the equivalence among them. 


• Efficiency. Most of the existing truth discovery meth¬ 
ods adopt iterative procedure to estimate source relia¬ 
bilities and compute truths. This might be inefficient 
in practice, especially when we apply truth discovery 
on large-scale datasets. Some attentions have been 
given to the efficiency issue in streaming data scenar- 
@[ 321164 , 
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However, more efforts are needed 
to improve the efficiency of truth discovery for general 


• Object relations. For most of the truth discovery 
approaches, they assume that objects are independent 
with each other. However, more and more application 


• Performance evaluation. In the current research 
work about truth discovery, the groundtruth informa¬ 
tion is assumed to be available for the purpose of per- 


















formance evaluation. Unfortunately, in practice, we 
cannot make this assumption. For example, in the 
task of knowledge graph construction [if], the num¬ 
ber of involved objects is huge and it is impossible to 
have the groundtruths for performance validation. It 
requires expensive human efforts to get even a small 
set of labeled groundtruths. How to evaluate the per¬ 
formance of various truth discovery methods when the 
groundtruth information is missing? This becomes a 
big challenge for the truth discovery applications. 


6. APPLICATIONS 

Nowadays, truth discovery methods have been successfully 
applied in many real-world applications. 

• Healthcare. In online health communities, peo¬ 
ple post reviews about various drugs, and this user¬ 
generated information is valuable for both patients and 
physicians. However, the quality of such information 
is a big issue to address. In 42], the authors adopt the 
idea of truth discovery to automatically find reliable 
users and identify trustworthy user-generated medical 
statements. 

• Crowd/social sensing. Thanks to the explosive 
growth of online social networks, users can pro¬ 
vide observations about physical world for various 
crowd/social sensing tasks, such as gas shortage re¬ 
port after a disaster, or real-time information sum¬ 
marization of an evolving event. For these partici¬ 
patory crowd/social sensing tasks, users’ information 
may be unr elia ble. Recently, a series of approaches 
[5] |26| 60] |66fj69] have adopted truth discovery to im¬ 
prove the aggregation quality of such noisy sensing 
data. 


Crowdsourcing aggregation. Crowdsourcing plat¬ 
forms such as Amazon Mechanical Turk 1 provide 
a cost-efficient way to solicit labels from crowd work¬ 
ers. However, workers’ quality are quite diverse, which 
brings the core task of inferring true labels fr om the 
labeling efforts of multiple workers [6] |12[[28||51[|55||56[ 
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proaches focus on learning true labels or answers to 
certain questions. The main difference between crowd¬ 
sourcing aggregation and truth discovery is that the 
former is an active procedure (one can control what 
and how much data to be generated by workers) while 
the latter is a passive procedure (one can only choose 
from available data sources). 


• Information extraction. For information extrac tion 
tasks, such as slot filling [76] and entity profiling [27] , 
related data can be collected from various corpora and 
multiple extractors can be applied to extract desired 
information. The outputs of different extractors can 
be conflicting, and thus truth discovery has been in¬ 
corporated to resolve these conflicts. 


• Knowledge base. Several knowledge bases, such as 
Google Knowledge Graph [ 3 ], Freebase [ 2 ] and YAGO 
[2], have been constructed, but they are still far from 
complete. Truth discovery is a good strategy to au¬ 
tomatically identify trustworthy information from the 


Internet. Existing work 17,18 has demonstrated the 
advantages of truth discovery on this challenging task. 
Meanwhile, the source reliabilities estimated by truth 
discovery can be used to access the quality of web¬ 
pages 119]. 


Truth discovery is important to big data and social media 
analysis where noisy information is inevitable. Besides the 
above applications, truth discovery has great potentials to 
benefit more applications, such as aggregating GPS data 
from crowds for traffic control, distilling news from social 
media, and grading homework automatically for massive 
open online courses. 


7. RELATED AREAS 

From a broader view, there are some research topics in the 
area of aggregation and fusion that are relevant to truth 
discovery. These research topics include multi-view learn¬ 
ing/clustering, rank aggregation, sensor data fusion, meta 
analysis, and ensemble learning. However, the problem set¬ 
tings of these research topics are different from that of truth 
di scovery . Specifically, in multi-view learning/clustering 
[8,11 [73], the input is data of various feature sets (views) 
and the task is to conduct classification or clustering on 
multiple views jointly. Rank aggregation [23] 33 has a dif¬ 
ferent input/output space with truth discovery as it focuses 
on the aggreg ation of ranking functions. In sensor data 
fusion p6]|41| , the source reliability is not considered and 
usually all the sources are treated indistinguishably. In 
meta analysis 13,[34], different lab studies are combined 
via weighted aggregation, but the source weight is derived 
mostly based on the size of the sample used in each lab study. 
In ensemble learning 54| |82] , the weights of base models are 
inferred through supervised training whereas truth discov¬ 
ery is unsupervised. Compared with these research topics, 
truth discovery methods aggregate data collected from var¬ 
ious sources on the same set of features on the same set 
of objects, and the goal of truth discovery is to resolve the 
conflicts among multiple sources by automatically estimat¬ 
ing source reliability based on multi-source data only. 
Another relevant research area is information trustworthi¬ 
ness analysis. First, in some trustworthiness scoring sys¬ 
tems, various methods have been developed for evaluating 
the reputation or trustworthiness of users, accounts or nodes 
in the context of social media platforms [43[ 61 , or sensor 
networks [5][62]. However, these systems focus on the com¬ 
putation of trustworthiness or reliability degrees by consid¬ 
ering the content and/or links among sources. Second, in 

[63 . 


trust-enhanced recommendation systems 
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ommendation for one user on one item can be computed 
by a weighted combination of ratings from other users on 
this item, where weights incorporate trust between users. 
In such systems, trust and recommendation are derived sep¬ 
arately from different types of data - Trust is derived based 
on users’ provided trust values about others or the recom¬ 
mendation history information, while recommendation is 
conducted based on current user-item ratings and the de¬ 
rived trust. Compared with these studies, truth discovery 
tightly combines the process of source reliability estimation 
and truth computation to discover both of them from multi¬ 
source data, and source reliability is usually defined as the 
probability of a source giving true claims. 

















8. CONCLUSIONS 

Motivated by the strong need to resolve conflicts among 
multi-source data, truth discovery has gained more and more 
attentions recently. In this survey paper, we have discussed 
the general principle of truth discovery methods, and pro¬ 
vided an overview of current progress on this research topic. 
Under five general aspects, most of the components of truth 
discovery have been examined. As the existing truth dis¬ 
covery approaches have different assumptions about input 
data, constraints, and the output, they have been clearly 
compared and summarized in Tables [2] and [3] When choos¬ 
ing a truth discovery approach for a particular task, users 
and developers can refer to this comparison as guidelines. 
We have also discussed some future directions of truth dis¬ 
covery research. More efforts are highly in demand to ex¬ 
plore the relations among objects, which will greatly benefit 
the real-world applications such as knowledge graph con¬ 
struction. Furthermore, efficiency issue becomes a bottle¬ 
neck for the deployment of truth discovery on large-scale 
data. Besides, how to evaluate the performance or validate 
the identified truths is a big challenge due to the fact that 
limited groundtruth is available in practice. 
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